6
Applications in Computer Vision
6.1
Introduction
In this section, we introduce the applications of binary neural networks in the field of com-
puter vision. Specifically, we introduce the vision tasks including person re-identification, 3D
point cloud processing, object detection, and speech recognition. First, we briefly overview
these areas.
6.1.1
Person Re-Identification
A large family of person re-id research focuses on metric learning loss. Some of them in-
troduce verification loss [248] into identification loss, others apply triplet loss with hard
sample mining [41, 203]. Recent efforts employ pedestrian attributes to improve supervision
and work for multi-task learning [213, 232]. One of the mainstream methods is horizontally
splitting input images or feature maps to take advantage of local spatial cues [132, 219, 271].
Similarly, pose estimation is incorporated into the learning of local features [212, 214]. Fur-
thermore, human parsing is used in [111] to enhance spatial matching. In comparison, our
DG-Net relies only on simple identification loss for Re-ID learning and does not require
extra auxiliary information such as pose or human parsing for image generation.
Another active research line is to use GANs [76] to augment training data. [294] is first
introduced to use unconditional GAN to generate images from random vectors. Huang et
al. proceed in this direction with WGAN [4] and assign pseudo-labels to generated images
[95]. Li et al. propose to share weights between re-id model and discriminator of GAN [76].
In addition, some recent methods use pose estimation to generate pose-conditioned images.
In [103] a two-stage generation pipeline is developed based on pose to refine the generated
images. Similarly, pose is also used in [71] to generate images of a pedestrian in different
poses to make the learned features more robust to pose variances.
Meanwhile, some recent studies also exploit synthetic data for the style transfer of pedes-
trian images to compensate for the disparity between the source and target domains. Cycle-
GAN [300] is applied in [296] to transfer the style of pedestrian image from one data set to
another. StarGAN [44] is used in [295] to generate pedestrian images with different camera
styles. Bak et al. [7] employ a game engine to render pedestrians using various illumination
conditions. Wei et al. [241] take semantic segmentation to extract the foreground mask to
assist with style transfer.
6.1.2
3D Point Cloud Processing
PointNet [192] is the first deep learning model that processes the point cloud. The ba-
sic building blocks proposed by PointNet, such as multi-layer perceptrons for point-wise
feature extraction and max/average pooling for global aggregation, have become a popular
design choice for various categories of newer backbones. PointNet++ [193] exploits the met-
DOI: 10.1201/9781003376132-6
149